Best Practices

In this notebook, we provide best practices for using the PIP-175k dataset for training on MEVA.

1. Use framewise labels

This dataset uses multi-label activities with dense bounding box annotations. Each object may be performing zero or more activities simultaneously, and the framewise labels capture when an object is performing an activity in a given frame. This means that a person can be simutaneously be performing two or more activities such as "person_talks_on_phone" and "person_opens_facility_door". This can also manifest due to the MEVA annotation definitions, which can introduce overlapping activities such as "vehicle_dropping_off" and "vehicle_stopping".

We recommend using the framewise labels to export tubelets of deforming bounding boxes over time for training. Examples of extracting labels and boxes from the toolchain are shown below.

2. Use a multi-label loss

The use of joint activity labels means that activities can occur simulataneously. A single actor can be performing more than one activity at the same time, which means that a loss that assumes one-hot ground truth labels (e.g. categorical cross entropy) is an inappropriate choice for training. Instead, we recommend a framewise mutli-label loss that can be trained with multiple simulataneous labels per frame (e.g. binary cross-entropy).

3. Use your object proposals

We recommend running your proposal generation pipeline on these videos to output your own object tracks for encoding the clips for training. This will use the proper bounding box style for encoding tracks for representing activities.

For example, the following code will run an object detector on each frame of video, and compute the intersection of the returned object detections with the ground truth using a greedy bounding box assignment based on bounding box intersection over union. You can use the resulting annotated frame (e.g. imdet.objects()) as a replacement for the ground truth with your proposals.

4. Undo the MEVA Temporal Padding

The MEVA annotation requirements includes class specific temporal padding which introduces up to five seconds of activity padding before and after activities occur. In order to be consistent with the MEVA annotation definitions, we have introduced the MEVA padding as a post processing step. However, this padding can introduce label error during training due to background frames mislabeled as the target class. our videos were collected with tight temporal boundaries as determined by the collectors when the videos were recorded. We recommend undoing the MEVA padding, and labeling the padded frames as a framewise label. Contact us at info@visym.com and we will provide you these precise framewise labels.

5. Use the Collector labels

Our collection platform includes additional labels that can aid in your training. We subdivide broad classes into subclasses, which provide a more challenging task for training. For example, we break out the broad class "person_puts_down_object" into "person_puts_down_object_on_shelf", "person_puts_down_object_on_floor" and "person_puts_down_object_on_table". These are visually distinct activities, which can be rolled into a single class "person_puts_down_object", but we recommend using the sub-classes during training to reduce overfitting. Then, at test time, the original MEVA labels can be used.

Also, the collection platform includes additional weak labels that can aid in your training. These labels are stored as metadata for each video and include:

You can access this metadata as a dictionary or by casting a video to a Collector Video objects.

Our collections are organized to introduce diversity in the scene. For example, we specify to the collectors to load and unload both from a trunk and from a rear door of a vehicle to help introduce intra-class diversity for this class. Furthermore, we specify the style of some classes, such as "talk while fidgeting" to introduce additional intraclass variation into this class to reduce actor bias. We also separate out motorcycles and cars as separate activity classes. The full list of collection names are self explanatory and are available as follows.

6. Use background stabilized videos

Our pipelines support optical flow based stabilization of video. This removes the artfacts due to hand-held cameras to stabilize the background. Remaining artifacts are due to non-planar scenes, rolling shutter distortion and subpixel optical flow correspondence errors.

The pip-175k-stabilized release was constructed by running this stabilization on all videos and updating the object boxes accordingly. You can run this yourself as shown below, or use the public release. You can use the attribute "stabilize" to filter on the stabilziation residual to filter out those videos with too large a distortion.

7. Export to your pipeline

You can export torch or numpy arrays, or just transcode your videos for native ingestion into your pipeline at the appropriate frame size.

We recommend that you introduce data augmentation in the form of scale jittering prior to training. The MEVA dataset includes many ultra-tiny people walking far from the camera. The PIP dataset does not include these tiny people, but the scale variation can be introduced by downsampling the crops to an appropriate resolution to best match the domain shift prior to training.